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ABSTRACT 

Research on protection of statistical databases from revela- 
tion of private or sensitive information [Denning, 1902, ch. 6] has 
rarely examined situations where domain-dependent structure 
exists for a data attribute such that only a very few independent 
variables can characterize it. Such circumstances can lead to 
Diophantine (integer-solution) equations whose solution can lead to 
surprising or compromising inferences on quite large data popula- 
tions. In many cases the Diophantine equations are linear, allomng 
efficient algorithmic solution. Probabilistic models can also be 
used to rank solutions by reasonability, further pruning the search 
space. Unfortunately, it is difficult to protect against this form of 
data compromise, and till countermeasures have disadvantages. 



1. Two problems 

Consider a university personnel database, and the set of salaries of faculty. 
Suppose there are only three ranks (assistant, associate, tuid fitU professor) 
with salary the same for ciLl members of a r ank . Suppose we know from read- 
ing the catalogue the number of faculty at each rank, and suppose we know 
from the ajanuzd financial report the total amount of salary paid to professors 
(or equivalently, the mean for the institution). Then we can \rrite a linear 
Diophantine (integer^olution) equation in three variables, and solve for the 
the salary associated with each rank. We will in general obtain a finite set of 
possible values fer each salary, which can be pruned if we know additional 
information such as reasonable limits on faculty salciries or the rcEtriccioo 
they be multiples of one thousand dollaurs. 

There is a kind of dual problem to this one. Suppose we know the total tonnage 
of British naval vessels in the South Atlantic, and suppose we also know from 
published sources the tonnages of the only types of ships owned by the British 
navy. He cam then write a linear Diophauatine equation and solve for all possi- 
ble fleet compositions. If we know information such as bounds on the total 
number of ships in the British fleet we can narrow the possibilities further. 

These two examples represent what we cadi the "unknown-vriluss" and 
"unkncwn-cc’ants” Diophanti.ae problems, respectively. (Gr "Dicphaintme 
compromises", but the latter word is mostly used for indivitiuel data-item value 
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revelations, and the inferences here are about sets, though occasionally sets of 
size 1.) They arise whenever the following conditions all hold in regard to some 
attribute A and some set S: 

1. A is numeric 

2. A has only a small number of distinct values for the set S 

3. v/e know the sum of all the values of A for S (or equivalently, the mean of 
the values and the size of S) 

4. we either know (a) the number of items having each value of A for this set 
(the unknown-values problem), or (b) the exact VcJues that do occur for A 
for this set (the unknown-counts) 

5. the unknowns are drawn from a finite universe having not "too many" 
members. 

By "small" in (2) we mean on the order of 10 or less, and by "too many" in (5) 
we mean on the order of ICOO or more. Situations that satisfy these restric- 
tions eirise usuailly with "eirtifieial" attributes representing invented codes eLnd 
measured properties of man-made objects. Often they involve joins, either 
explicit or implicit, of a small relaUon with unique-valued attributes represent - 
ing fixed properties of objects ttIOi a larger relation representing relationships 
or activities of those objects. 

Mathematically: 

Nl^^Vl N2^V2 -f N3=^V3 -f- =S 

where the are the possible data veducs, the N*s the mimher of occurrences 
Cl each, and S the total. In order to ensure thc.t this equation is Diophanlme, 
since the may be rational, wo should both sides by the grevAoct com- 

mon di^Tisor of the \^s. 



If roimdiog and/or truncation is used in calculallng S so that the result is not 
exactly the simx (as may sigrxifjicantiy occur with totals of large sets), we e^in 
often infer the true sura since the greatest common divisor of the Vs divides 
the right side en integer number of times. Me round S to the nearest integer 
multiple cl that divisor. 



2. Multiple constraints 

I^owlcdge of the sum on some attribute is a linear equality constraint, using 
the temmc4cgy of eptimization. There are many additional ways of obtaining 
both eq)roli:y and inequality ccnslr amts on the Diophantice sciutioD, mak?ng 
protective countermeasures for the database difScult. Of course, if wa obtain 
as many independent equations as variables we can often determine Oaeiiia 
uniquely. But oven when we have mere unknoT\iis theji equations the Dicphcin- 
tine (Integer) restriction itself can Ocirrow tbe possibilities to a smell number. 

We can brieSy' summarize the categories cf additional equality constraints tliat 
may be available (see [Eows, In press] for mere details cf each except the lent 
two): 

1. additioinal monients on the date. (e.g. standard deviation), wdiich gice 
linear Diophantine equations for the unknoTvm-courits case, nonjiaeai' fOx'' the 
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unknown-values. The sum of the values can be considered the zeroth 
moment. 

2. corresponding m_oments on attributes that are in one-to-one relationship 
to the attribute of interest (i.e. that show "extensional" functional depen- 
dencies in both directions, functional dependencies true only for a particu- 
lar database state), and hence have the same frequency statistics. These 
give additional linear Diophantine equations. 

3. a generalization of the preceding, corresponding statistics on attributes 
that have an extensional functional dependency in only one direction, to or 
from the attribute of interest. These give linear Diophantine equations on 
new variables that relate by sums to the old. 

4. different factorizations of the same data, as in multidimensional con- 
tingency tables of sums (these give linear equations) 

5. statistics on unions of sets of interest, whether directly or indirectly by 
inserts to the database (these give linear equations) 

6. If type checking is not enforced, moment calculation routines can be 
applied to few-valued nonnumeric attributes, giving equations like those of 
item 3 (giving linear equations). 

7. If transformations of data values (e.g. logarithm, square root, reciprocal) 
can be applied before computing moments. Diophantine equations of the 
same form but generally different coefficients are generated. These are 
linear for the unknown-counts problem, nonlinear for the unknown-values. 

B. Even multi-argument arithimetio operations on data values can some- 
times be exploited. For instance, knowing the possible values for two attri- 
butes allov/s calculation of the possible values of the^r product, which if the 
former are integers cu*e not uniformly spaced. Another example is 
knowledge of the proportion of items having a certain property in a set, 
when the size of the set is not knowm. This is a "Diophantine approximation" 
problem where v’-e must find an a and b such that a/b is closer than some 
small error E to some proportion p. All solutions form scries such that if 
a/b is a solution, then so is k^a/k’^b for any positive integer k. For a ran- 
dom proportion, the number of solution series follows a binomial distribu- 
tion, with the simplest solution requiring on the average a denominator of 
the square root of 2/E. 

9. Joins can be used to create few-valued attributes as mentioned in the last 
section, but joins can also be used to get many additional equality con- 
straints on a set. If we know the mean of an attribute of a relation, we can 
compare it to the meon of the same attribute after the relation has been 
joined with another on some other join attribute(s). We can take subrela- 
tions of the second relation, or use different second relations, or join on 
different attributes, to get a variety of different equations. The equations 
are (surprisingly) linear for the unlmovm-values problem, but norilinear for 
the unknown-counts problem. 



Mditionnl inequality constraints can also prune the solutions possibilities for a 
set of Dicphanline equations (again see [Rowe, in press] fcr more detaihi): 
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1. bounds on the frequency distribution of the values for the unknown- 
counts problem, such as the mode frequency or the frequency of the least 
common item 

2. bounds on the values for the unknown- values problem, such as absolute 
maxima and minima that it is impossible for values to go beyond 

3. medians and other order statistics on the set, which state how many 
items can lie in a certain value range (useful for both unknown-counts and 
unknown- values) 

4. the number of items having certain values in any superset containing the 
set of interest (needed for both). Often we know the number of items having 
particular values in the entire database, and we can also sometimes per- 
form an easier Diophantine analysis (because of number-theoretic peculiar- 
ities) on a superset. 



3. Solving the equations 

We can take all the constraints found by the methods of the last section and 
find a finite set of possible values for each variable by a variety of methods. 
Fortunately, most of the abovementioned equality constraints are linear, and 
there exist sophisticated methods covered in [Chou and Collins. 1B32] for 
finding solution vectors of matched variable values for this information. \7e 
can then apply inequality constraints successively, filtering out those vectors 
inconsistent with them. The possible values of the kth variable are then just 
the possible kth vector components. 

But our goal of finding all possible values for a variable is less general, and we 
can take some shortcuts in the above approach. In parUculaLr, we cem filter out 
many possibilities a priori with the following two rules: 

1. In the Diophantine equation Cl * XI + C2 * X2 C3 ♦ X3 + ... = S, where 
the C's are constants and X's are unknovms, find the pair of C’s which are 
relatively prime (if any) and have the smallest product, and call them CJ 
and CK. Then for any other term I. XI can take any integer value from 0 to 
(S - (CJ CK) + CJ + CK - 1) / CL (This says nothing about larger values for 
XI.) This follows from the number theory result that for N>A’*‘B-A-B, 
there exist some integers X and Y such that A^X + B^Y^N. 

2. In the Diophantine equation Cl * XI + C2 * X2 + C3 X3 + ... ~ S, if for 
some 1, XJ is divisible by some integer F > 1 for all J not equal to 1, then (Cl * 
XI) mod F = S mod F for any solution value XI. 

Various methods cam be used for nonlinear Diophantine equations Leo [Mordell, 
1969]. General algcrithms do not exist, but there are many special-purpcee 
tools (e.g, analysis in modules arithmetic). An eidiaustive combinatorial 
search can be fallen back on, since one can almost alivays find absolute bounds 
on the integer unknowns. 

A serious prcblem with protecting databases from too-powerful Diophantine 
mfcrencos is that they vary consideraLily in efiecUveness. and it is difSctill to 
geueraii.ze about the^r jK>wer. This is because they are based on number 
theory, and are very sensitive to the iower-order bits of all given numbers. To 
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return to our first example of the paper, if there are 13 assistant professors, 7 
associate, and 20 full, and their salary sum is SI, 200, 000, and we assume possi- 
ble salaries are multiples of SIOOO, then we can use rule 1 above and say that 
assistants can have any salary from 0 to (1200 - 140 + 20 + 7-1)/ 13 = B3.5 
thousands, associates (1200 - 260 + 13 + 20 - 1) / 7 = 139.2 thousands, fulls 
(1200 - 91 + 13 + 7 - 1) / ^ = 56.4 thousands. But if a new full professor is 
added to the faculty, then all of a sudden we can apply rule 2 above and find 
that (13 ♦ K) mod 7 = (-X) mod 7 = 1200 mod 7 = 3, X mod 7 = 4, and hence the 
only values possible for assistant professor salaries are 4, 11, 18, 25, 32, 39, 46, 
53, 60, 67, 74, and 81 thousands. Bounds can rule out possibilities, so if say we 
know the range is 824,000 to 231,000, we can infer a unique value of 225,000, 
whereas with those saimc bounds before the professor was added we had 8 solu- 
tions. 

The fact that the number of solutions to a Diophantine problem cam vary so 
widely (and even more so with nonlinear Diophantine problems) mecins that 
someone wishing to intentionally facilitate unwarranted inferences (or a user 
with insert capeibility wishing to compromise) could choose a set cf values or 
counts, perhaps just by fudging of true values, that could help enormously. 
And note an "'easy" set of values or counts remains "easy” for any ngM-hand 
side constant, though constraints affect the actual number cf solutions. 
Unfortimately, while some "easy" sets of values are apparent on inspcciicn, 
others are not. 



4. Ranlnng solutions 

We can often do more them just obtain possibilities consistent with constraints. 
We can rank possibilities by reasonableness, perhaps assigning probabilities to 
quantify it. For instance, if an attribute represents the number of children In 
an employee's family, the value 10 is possible, but unlikely; so a solution for 
the frequency of values in a subset that has half of the employees with 10 chil- 
dren is highly unlikely. 

A good general-purpose way of obtaining ranks for the unlmc^Ti-ieounts prob- 
lem is possible when the number cf items having each pcirticidar value is 
known for the database as a whole. Then the distribution of values in any sub- 
set of the database can be modelled by a multinomial distribution if we assum o 
independence of value occurrence, with probabilities equal to frequencies cf 
values in the full database. Of course, knoxTledge of a particulai* database 
domain may suggest other "nonlinear" ranl^ing methods which may supersede 
this. For example, for our professor data we may think that the difference 
between full and associate is probably pretty close to the difference between 
associate and assistant, and unlikely to be three or four times, or one third or 
one fourth. 



5. Multivariable-dependent attributes 

Thusfar we have required attributes with relatively few distincl numeric 
values. There is an impcrtaint generalization lo attributes with perhaps many 
values, but values all detei'mined by a fev/ independent variables. Gensider tfie 
saler}^ policy for most employees of Stemford Uoiversily, roughly inodeliabie as 
a logarithm of years of service, starting ironi a certain "level". Thus there can 



be many different employee salaries, but they can be explained by one of ten or 
so values for "level" and a variable for number of years of service (i.e., there is 
a functional dependency from level and years to salary). We can write an equa- 
tion: 



LI log(KY) S 

which we can msikc Diophantine by dividing by the greatest common divisor of 
the left hand side. So if we know the levels and years for any subset containing 
aU levels, and its salary sum, we can solve for possibilities for the LI and K. If 
we know more subsets, we can narrow the possibilities further. 



6. Countermeasures 

Good coimtermeasures against these inferences are hard. AM methods have 
serious drawbacks. 

Protection by limiting statistics computed on a set of data is one possibility, 
but may require computationally expensive analysis, since nothing less than 
attempting to solve every possible Diophantine situation in advance will do. 
Fortunately, hov^ever, the checklist, given in section 1 is not satisfied very 
often, primarily because lew-valued numeric attributes are rare. But they do 
arise from time to time, and when they do, little short of comprehensive 
threat analysis will do. Note that query overlap controls used to protect 
agadnst a variety of classical compromise methods [Denning, 1S£2, ch. 6] ore 
useless here because strong inferences can be derived merely from different 
queries on the same set, or even sometimes a single query. Conlrcls that 
suppress statistics on particularly small sets are some help, but since the 
power of Diophantine methods is highly sensitive to the lower-order bits of 
coefneients, this doesn’t help veiy much. 

A better form ci protection seems to be perturbation of the data or query out- 
put, since this can affect low-order bits severely. The perturbations have to be 
random, or the user might be able to discover the perturbation hy eiryerimeiil 
and possibility elimination, and they have to be pseudo-random as opposed to 
truly random, or the user might zero in on true values by asking the serae 
query repeatedly. Ihe perturbatiens must also be sufficiently large that user 
cannot Just specify a small range of true valiies for statistics, given the per- 
turbed vedues, and intersect aii the results obtained from solving separately a 
Diophantine equation set for each possibility. Some of these equation sets may 
be immediately ruled out as impossible (e.g. anything with 4x + lOy -i- lOz = 
101 because the sum of even numbers can’t be an odd number), while the 
equation set corresponding to the true values of the means and moments is 
always guaranteed to have a solution, the solution correspcnfding to the true 
state of tlie world. 

A certain amoiuit of query output perturbation may occur automaticaJliy in e. 
database S 3 ^tem due to rottncleg and.^cr tnjncalion of statistics caicuiated ca 
large sets. However, this perturbation may be quite systematic as opposed to 
reandom, and is likely to vary considerably in magnitude mth the sizes of the 
sets being analyzed as weM as the vedues being and thus rarely can 

offer certain protection. 

A curious 2 ispect of BiophaDllne jnlerencss is that they can stf.il be pcssibic 
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even when the data is encrypted and statistical aggregates cannot be calcu- 
lated directly. Assuming the encryption is on each data attribute value and not 
on blocks of attributes, the one-to-one nature of the encryption jM-ocess which 
is necessary for recoverability makes encrypted data functionally dependent in 
both directions on the orlgined data; hence we can use the methods listed 
imder equality constraint 2 of section 2, provided we know the mean of the 
encrypted values, as we may quite eaisily in a publicdcey system. The solution 
is only a frequency distribution, and does not give correspondences between 
encrypted values and true values, but often many of these can be identified by 
inspection and methods similar to the solution of simple substitution ciphers 
by tables of English letter frequencies. 



7. Conclusion 

Diophantine inferences can pose an important threat to the confidentiality of 
certain kinds of data. Their power in specific cases is difficult to categorize 
short of detailed number-lheoreLic analysis. Protection measures involving 
witholding of statistics are wecik in effect, and protection involving data or 
answer perturbation seems to be the only real possibility, with its eoncemitant 
disadvantages of degrading statistic quality. As best as we can tell, however, ro 
publishers of summary statistics have addressed this type of ccmpj’cmisc, 
including census agencies which have been concerned with other types [Ccx, 
1980; Sande, 1983J. It seems important that they become concernecL 
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